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, Recent studies attracted the attention on the inherent structure landscape (ISL) approach as a 

■ reduced description of proteins allowing to map their full thermodynamic properties. However, the 

04 ' analysis has been so far limited to a single topology of a two-state folding protein, and the simplifying 

^ \ assumptions of the method have not been examined. In this work, we construct the thermodynamics 

^ . of four two-state folding proteins of different sizes and secondary structure by MD simulations using 

^\ • the ISL method, and critically examine possible limitations of the method. Our results show that 

the ISL approach correctly describes the thermodynamics function, such as the specific heat, on a 
, qualitative level. Using both analytical and numerical methods, we show that some quantitative 

(N . limitations cannot be overcome with enhanced sampling or the inclusion of harmonic corrections. 
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PACS numbers: 87.15.A-, 87.15Cc, 05.70.-a 



I. INTRODUCTION 



. The biological and physical properties of proteins are compelling for many reasons. While just a small amount of 
^ the nowadays hundreds of thousands known protein sequences are experimentally characterized, the variety of their 
I ; functions is overwhelming. Though the structure has been resolved only for a subset of these sequences, the number 
d of stable folds that are expressed in nature is seemingly small compared to the number of sequences. The relationship 
^ ] between fold and function is far from obvious, and examples such as intrinsically unstructured proteins and multi- 
' functional folds resist simple schemes for classification. The question what really makes a protein functional hence 
needs to be addressed in the context of its specific biological environment. 

From a physical point of view, an attempt to find some unifying concepts for the interpretation of dynamics and 
thermodynamics is the description of proteins in term of energy landscapes , in which the evolution of the system 
is related to the dynamics on a high-dimensional rugged energy surface. The existence of local minima, connected by 
saddles of different barrier height and rank, leads to a distribution of timescales that are reflected in the dynamics of 
the proteins. 

' Although the energy landscape provides a reduced description, the complex interactions in proteins and their inter- 
' action with the environment, which involve multi-body interactions and subtle effects of charges, make its complete 
f — > characterization neither experimentally nor theoretically conceivable. This situation is somewhat reminiscent of other 
' complex systems such as glasses which, though being more homogeneous systems, share the property of displaying 
, a high-dimensional landscape leading to complex dynamics. The energy landscape picture is useful for a qualitative 
' analysis of protein properties, but for quantitative studies, an exhaustive sampling and and the full characteriza- 
. tion are practically infeasible for this high-dimensional representation. Therefore, in order to obtain comprehensive 
■ quantitative predictions on generic protein properties from the information on the landscape, the picture needs to 
^ ' be simplified. A recent work[l| has shown that a reduced description of the energy landscape, originally devised for 
' ^ the analysis of super-cooled liquids by Stillinger and Weber can successfully capture the essential thermodynamic 
aspects of folding in the context of a simplified protein model. In particular, it was shown that the density of states 
constructed from the local minima of the energy landscape, called inherent structures, can be used to compute the 
most important thermodynamic observables. This finding is important because it provides a general scheme for theo- 
retical studies of protein thermodynamics, showing how the relevant information can be quantitatively accessed from 
its imprint on the potential energy surface. The approach has consequently been used in the context of studies of 
the folding properties of a /3-barrel forming protein the construction of the free energy landscape by mechanical 
unfolding[5], and the network of native contactsj^. Other earlier works including inherent structure analysis but not 
necessarily seeking to characterize the full thermodynamics of folding can be found in 0, Q . 

However the validity of an analysis based on the inherent structure landscape (ISL) must be critically examined 
because the method involves a fundamental assumption which could be questioned: The vibrational free energy within 
the basin of attraction of an inherent structure is assumed to be independent of the basin. A recent study [q tried to 
go beyond this approximation by assuming that the vibrational free energy can depend on the energy of the inherent 
structures. Still, the question is subtle as we show in the present work that, even when the vibrational free energy 
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depends on the inherent structure energy, the derivation of thermodynamic quantities such as the specific heat from 
the ISL can be vahdly carried without any change in the procedure. Therefore an understanding of the hmits of the 
ISL approach requires a deeper analysis. This is the aim of the present work. 

We proceed in two steps. In a first step fSection lIII[) . after briefiy summarizing the ISL formalism and introducing 
the protein model used in Section [TTl we test the validity of the ISL approach by comparing its results to the data 
obtained from equilibrium molecular dynamics for a set of structures. We selected four previously unstudied two-state 
folding proteins of varying size and secondary structure elements. In a second step (Section IIVI) . we critically revisit 
the major hypotheses of the ISL approach, as well as its practical limitations, such as the sampling of the phase space, 
and suggest routes for improvements. Finally, we summarize our findings in Section|V]and give an outlook on possible 
future studies that stem from our results. 



II. METHODS 



A. Inherent structure analysis 

In this subsection, we briefly review the major results of [l, 9] on obtaining reduced thermodynamics from an 
analysis of the inherent structures. 

The method is general and not bound to a specific protein model, provided the phase space of the protein can be 
explored by molecular dynamics, and that the energies of the visited states can be calculated. From simulations at 
fixed temperature close to the folding temperature T/, which insures that the system evolves in a large part of the 
configuration space, the local potential minima, labelled by ai, are determined by conjugate gradient minimizations 
performed at fixed frequency along a molecular dynamics trajectory. The global minimum ao is defined as the 
reference ground state with zero energy. Let {xi}, i = 1,2, ...,3N, denote the 3A^ Cartesian coordinates of the A^- 
particle system, and V{{xi}) its potential energy function. The probability to find a particular minimum ai with 
potential energy Cq,. can be written as 

' Z{T) Jb(^^) Z{T) is(a.) ^ ' 

where AV^^ — V ~ Cq. , Z denotes the configurational part of the partition function and B{ai) is the basin of 
attraction of the minimmn a,-. With the definition 
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JB(ai) 



the unknown integral over the complex landscape of the basin of attraction B{ai) is summed in an free-energy like 
function F^[ai, T) which in principle depends both on the nature of the basin and temperature. Notice that although 
we use the index v like "vibrational", Fy(ai,T) is obtained from the full nonlinear integral over B(ai), and not from 
its harmonic approximation. 

The inherent structure landscape approach makes two key assumptions 1] which enable to considerably reduce the 
amount of information needed on the landscape while keeping its most important features. 

• (Al) The function Fy{a,T) for two minima ai,a2 that are distinct but close in energy, ea-^ ~ Gc^, is the same 
for both minima: Fy(ai,T) « Fy{a2,T). Consequently, Fy{a,T) « Fy{ea,T). 

• (A2) The function Fy{ea,T) does not vary significantly for different minima, i.e. Fy{ea,T) « Fy{T). 

Both assumptions were discussed in |9j. In section lTVl we show that assumption (A2) can actually be relaxed to the less 
strong form j3Fy{ea, T) « fv{ea) + pFy(0, T) while most calculations remain feasible and some of the thermodynamic 
variables unchanged. With these assumptions, the contribution from the function Fy factorizes in the numerator and 
denominator of ([1]) so that it can be eliminated to give 

where the sum in the partition function includes all inherent structures found from the global minimum ag to the 
minimum amax having the highest energy. Here, the energy scale is shifted such that the energy of the global 
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minimum ao is zero. Introducing an energy density function for the inherent structures rJ/s(e) = X]q=qo ~ ^q), 
the probability to find a minimum in the interval [cq., Cq + dea] at temperature T is 

i'/s(ea,T)de„ = -^r!7s(e„)e-^^= de„ , Zis ^ \ "^^^ de^^isifio)e-<^'- . (4) 

For the model used in this work, the low energy minima are in practice sparsely separated in energy. As the ground 
state is isolated, one obtains Zis{T) = 1/pq{T) with the probability of the ground state po{T), so that the inherent 
structure density of states can be estimated from the probability to be in the basin of attraction of a minimum in a 
fixed temperature simulation at temperature Tmd as 

^is{ea) = zPis{ea,TMD) ■ (5) 

Po{Taid) 

Though above we have chosen a continuous notation to simplify the equations, it should be noted that the density 
built from an estimate of the probability density function of sampled minima for the present model in practice 
always comprises discrete and continuous parts which can be integrated separately. Once the inherent structure 
density of states is known, one can compute the inherent structure partition function Zjs from from which all 
thermodynamics functions can be derived, including the free energy Fjs and the internal energy Uis- Given that most 
states of the system can be sampled close to the folding temperature, it is sufficient to simulate the system at a single 
temperature Tmd to construct the inherent structure landscape, in contrast to the full thermodynamics where one 
needs to sample different ranges of temperatures. Therefore, the ISL approach can be computationally very efficient. 
In the following, we will restrict ourselves to the computation of specific heat Cyjs being a quantity of fundamental 
importance in a physical system, as it is sensitive to fluctuations and, for instance, shows a clear signature of phase 
transitions. It can be deduced from numerical derivatives of the partition function Zis through 



and hence 



B. Model and selected proteins 



Since our goal is to analyze the validity of the ISL approach and not to derive quantitative data for a particular 
protein, wc decided to choose a simplified model, which allows the sampling of phase space at a reasonable compu- 
tational cost. However the model must be rich enough to properly describe the complex features of its physics, and 
should be able to distinguish between proteins which differ, for instance, in their secondary structure. We use frus- 
trated off-lattice Go-models identical to the ones introduced in [l| because they provide a good compromise between 
all-atom simulations and simplified models that do not fully describe the geometry of a protein. These models provide 
a representation with a single particle per residue centered at the location of each Cc-atom. For details on the model 
and the parameters, we refer to P, P] and a brief review in the appendix [XI Although the validity of such models to 
provide a faithful representation of protein folding is a recurrent subject of debate, off-lattice Go-models have been 
successfully used to study folding kinetics 10] and the mechanical resistance of proteins |TI|. From a physical point 
of the view, despite a strong bias towards the ground state, these models have a complex energy landscape with a 
large number of local minima well suited for the analysis in terms of inherent structures. 

As the results of the ISL approach depend on the density of states of the inherent structures, for a reliable test of 
the method it is important to examine examples which could differ in their properties, i.e. to investigate proteins of 
different size and structure. To test the inherent structure approach beyond the previously employed immunoglobulin 
(IG) binding domain of protein G (2GB1), we selected four two-state folding proteins of varying size and folds from 
the PDB databaseil2]: the trp-cagc mini-protein construct (1L2Y, 20 residues, a-helical), the ww domain FPB28 
(lEOL, 37 residues, /3-sheets), the src-SH3 domain (ISRL, 56 residues, /3-sheets) and ubiquitin (lUBQ, 76 residues, 
a — /3-fold). The motivation for these choices is discussed in Section [1111 for each protein (see insets of figures [T]|4] 
for structural representations of these four proteins drawn with pyMol[13]). The positions of the Ca-atoms of the 
PDB files is chosen as a reference for the construction of the Go-model. In the case of NMR resolved structures, 
the first structure is selected as the reference. The native contacts of the model were established according to the 
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FIG. 1: Results for src-SH3. Left: Inherent structure density of states Qis{ea)- The inset shows a close up on the low-energy 
range. The size of the energy bins for the density estimate is A i? = 0.2 ksTf. Right: Comparison of the specific heat 
from equilibrium trajectories Cv{T) (points), from which the specific heat of a harmonic oscillator in 3N dimensions has been 
subtracted, and Cv,is{T) from inherent structure analysis (solid red line); see text for further explanations. 



distances between atoms belonging to different residues. A native contact is formed if the shortest distance belonging 
to atoms of two different residues is smaller than 5.5^4. The number of native contacts according to this criterion 
are: Nnat = 91 for the ww domain, Nnat — 225 for ubiquitin, Nnat = 216 for src-SH3 and Nnat — 36 for trp-cage. 
This definition is simple and includes some arbitrariness. There exist other methods for probing contacts between 
side-chains, e.g. by invoking the van der Waals radii of residue atoms and solvent moleculesjl^. Though using the 
latter method preserves the main structure of the contact map, it leads to quantitative differences in the the number 
and location of contacts along the sequence. Consequently, one can expect that the topology of the energy surface 
and key thermodynamic properties such as the folding temperature are also altered when the definition of the contact 
map is varied. For the purpose of the present study which does not attempt to give a quantitative description of side 
chain contacts, and focusses on global properties of the landscape rather than its detailed relation to the network of 
contacts, the cutoff-based approach is acceptable. 

Molecular dynamics simulations were performed using the Brooks-Briinger-Karplus algorithm with a time-step 
of dt — 0.1 and a friction constant of 7 = 0.01,0.025 (all units in this section are dimensionless, see [il for details). 
To ensure equilibration, the system was thermalized starting from the native state (PDB coordinates) for t = 2 ■ 10^. 
The simulation time for a single temperature point and a single initial condition was t = 2- 10^, and the data obtained 
for both inherent structure sampling (fixed temperature) and thermodynamic sampling (variable temperature) were 
averaged over various initial sets of velocities. Minimization was performed using the conjugate gradient method with 
the Polak-Ribiere algorithm. To estimate the vibrational free energy at the minimum, mass weighted normal mode 
analysis was performed using LAPACK diagonalization routines. The second-order derivatives of the potential energy 
function at the minimum were calculated by numerical differentiation of the analytical first-order derivatives. 



III. REDUCED AND FULL THERMODYNAMICS OF A SET OF MODEL PROTEINS 



In this section, the validity of the ISL approach is tested by comparing the equilibrium thermodynamics deduced 
from molecular dynamics simulations to the reduced thermodynamics from inherent structure sampling. As discussed 
in Section |TT1 we evaluate the specific heat Cy as a function of temperature as a representative example of the 
thermodynamic observables. 



A. src-SH3 



The src-SH3 domain was chosen since it has the same number of residues as IG binding domain of protein G studied 
in [l|,[§], but contrasts to the latter in terms of structure. The src-SH3 domain is mostly composed of /3-sheets and does 
not contain an a-helical-secondary structure element (only five residues form a small right-handed helix segment). 
The inherent structure density of states, shown on the left-hand side of figure[l] was obtained from various simulations 
close to the folding temperature (Tmd ~ Tf) and built from « 72000 minima according to ([5]). After computing an 
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FIG. 2: Results for ubiquitin, see caption of figure \T\ for annotation. The size of the energy bins for the density estimate is 
A 0.24 ksTf. 



energy histogram using 1000 bins to yield an estimate of the inherent structure probability density, energy bins with 
only a single count have been discarded from the analysis to avoid a bias that could be introduced by insufRciently 
sampled isolated minima. The right hand side of figure [1] shows a comparison between the temperature dependence 
of the specific heat calculated from inherent structures, Cv,is{T), and the temperature dependence of the specific 
heat calculated from equilibrium molecular dynamics simulations at variable temperature Cv(T). The equilibrium 
thermodynamics has been determined by averaging the results of 10 initial conditions per temperature step except 
for points close to the transition where the results of 20 initial conditions have been used. Despite this averaging, the 
variance, indicated by error-bars on the y-axis, is large in the vicinity of the folding transition as the waiting time for 
a transition to occur becomes comparable to the simulation time. For a harmonic system, Cv{T) — NiiofkB/2 where 
Ndof is the number of degrees of freedom. At low temperatures T <^Tf, harmonic contributions are dominating, and 
the difference between Cy(r) and Cv,is{T) is approximately SNks, which is subtracted from Cv{T) in figure [TJ 

Figure [T] shows that the ISL approach is able to capture the main features of the thermodynamics of the Go-model 
of the src-SH3 domain. The value of the folding temperature is correctly determined by the ISL approach, but 
Cv,is underestimates the maximum by more than 20% if the highest point of Cv{T) is selected as a reference. This 
discrepancy at the maximum had previously also been observed for the inherent structure analysis of the IG binding 
domain of protein G [l|. On the other hand, towards higher temperatures, Cv,is{T) decays slower than Cv{T). 



B. ubiquitin 



Ubiquitin, with its 76 amino-acids, is a protein with a fairly rich secondary structure since it contains a a helix and 
5 (5 sheets. Similarly to the src-SHS domain, the ubiquitin Go-model presents a sharp folding transition associated 
with a large peak in the specific heat (see right-hand side of figure [2]). The specific heat Cy(T) was estimated from 
averages on 8 initial conditions per temperature step, and ^is{&a) was obtained using w 79000 minima from several 
independent trajectories close to the folding temperature using a histogram of 2000 bins. The agreement between the 
full thermodynamics and the ISL approach is better than for src-SH3, though similar trends of discrepancies can be 
observed. 



C. WW domain 



To contrast with sharp two-state transitions of protein G (56 residues), src-SH3 (56 residues) and ubiquitin (76 
residues), we selected smaller structures, the ww domain (37 residues) and the trp-cage (20 residues), to examine the 
performance of the inherent structure approach for less structured proteins, showing a broader transition. For such 
small protein domains, the validity of the Go-model can be questioned as the model is built from the geometrical 
structure of the folded state. For small molecules the discrimination between folded and unfolded states becomes subtle 
due to fiuctuations covering a large part of the accessible configurational space in a broad range of temperatures. The 
point in selecting these structures is not to asses the validity of the Go-model itself, but to test the ISL approach 
in very stringent cases to highhght possible limitations. The density of states ^is{ea) was obtained from k, 79000 
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FIG. 3: Results for the ww domain, see caption of figure [T] for annotation. The size of the energy bins for the density estimate 
is A £ = 0.05 kBTf. 




FIG. 4: Resuhs for trp-cage, see caption of figure [T] for annotation. The size of the energy bins for the density estimate is 
A £ = 0.02 ksTf. 



minima from several independent trajectories slightly above the folding transition (see figure [3]) using a histogram of 
2000 bins. In contrast to the two previous cases, the histogram of minima does not show a clear separation of basins 
of local minima associated to the folded/unfolded state. We observe a difference in the apparent shape of the density 
of states (left-hand side of figure[3]), which is globally concave in contrast to the convex densities obtained for src-SH3 
and ubiquitin. The same shape was also found for protein G in [ij, for which the relation between the concave shape 
and the two-hump structure of Pis{ea,T) was discussed in the vicinity of the folding temperature. Moreover, by 
comparing the insets of the left-hand side of figs. [T][3l one observes that the low energy range of ^ljs{ea) is more 
discrete, and states tend to lie less densely packed. 

The temperature dependence of Cy(T) was obtained by averaging over 12 initial conditions. An interesting feature 
of the curve is the shoulder in the low temperature range which indicates a partially unfolded structure associated 
to the breaking of a small number of contacts. Comparing the results of Cv{T) and Cv,is{T), it is apparent that 
though the specific heat reconstructed from inherent structure thermodynamics correctly captures the global shape of 
Cv{T), including the existence of the shoulder, important deviations can be observed. Similarly to the cases analyzed 
above, Cv,is{T) underestimates Cv{T) at lower temperatures while giving an overestimation at high temperatures. 
In contrast to the results for larger proteins, we also observe a significant shift of the transition temperature. 
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FIG. 5: ie/t- Comparison between the different inherent structure density of states Slis(ea) including the data for the IG 
binding domain of protein G from fl|; the color coding is: protein G (black), src-SH3 (green), ww domain (red), ubiquitin 
(blue), trp-cage (magenta). Right: Deviation between the specific heat obtained from full thermodynamics and inherent 
structure analysis. 



D. trp-cage 

With only 20 residues, the last protein fragment studied in this series is also the smallest, and mainly consists of 
a single a-helix. Its inherent structure density of states flisi^a) was estimated from w 90000 minima sampled from 
several independent trajectories close to the folding temperature using a histogram of 2000 bins. For such a small 
system, the low lying energy states are largely separated from each other, and the resulting density of states presents 
large gaps in a relatively broad range of energies. The continuum representation assumed for the inherent structure 
landscape is certainly questionable in such a case. The equilibrium thermodynamics were constructed from averages 
on 10 runs over different initial conditions. It is interesting to notice that the value of the specific heat computed 
with reduced thermodynamics is still fairly close to the actual specific heat, although we notice again that the peak 
is underestimated and a high temperature tail is observed as in the previous cases. In contrast to the results for the 
WW domain, the folding temperature of the trp-cage protein domain is correctly found by the analysis of inherent 
structures. 



E. Discussion 



Our studies of four proteins, combined with the earlier results on protein G [l|, |9|, allow us to describe some trends 
in the inherent structure analysis of Go-model proteins. 

For the density of states given by ([S]), a general exponential dependence, ^lisi^a) oc exp(— ea/fesTo) is observed for 
all proteins, with slightly different slopes for the low energy states, corresponding to states occupied in the folded 
configuration, and for the high energy states, occupied in the unfolded configuration. The value of Tq associated to 
the low energy range is a good estimate of the folding temperature, as previously reported for the case of protein G 
Figure [5] (left-hand panel), which compares the density of states of the inherent structures for the four proteins 
shows that, when being presented in reduced units as a function of ea/ksTf, the functional form of these densities 
is highly similar. For the large proteins that we studied, src-SH3 and ubiquitin (as well as protein G), the slope is 
slightly larger in the high energy range than in the low energy range. The converse is true for the small protein 
domains ww and trp-cage. A formal calculation of the reduced specific heat Cyjs from a bi-exponential density of 
inherent structure energies shows that this property is related to the sharpness of the folding transition. A density 
of states that is curved downwards for the energies associated to unfolded configurations leads to the broad folding 
transition expected for small protein domains. 

The calculation of the specific heat Cv.is shows that the ISL approach is able to determine the specific heat of a 
protein with reasonable accuracy, including the overall shape of the folding transition. To obtain such an agreement 
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the ground state probability Po{Tmd) must be sufficiently well sampled to ensure that the density of inherent states 
is correctly normalized. 

However, our studies of several proteins shows that limitations exist since systematic deviations from the full 
thermodynamics are apparent. At low temperatures, and up to the transition temperature, CyjsiT) underestimates 
the specific heat. The peak of Cv,is(T) is less pronounced than expected from the equilibrium trajectories, and tends 
to broaden towards higher temperatures [T > Tf) where Cv,is is larger than Cv{T). These deviation are shown in 
a comparative illustration in the right-hand side panel of figure [5] for the four proteins. Our data also reveals that 
the Go model, with its strong bias towards the native state, does not yield qualitatively differences depending on the 
secondary structure of the protein under consideration. 

Owing to the results found for various proteins which show systematic deviations from the results of equilibrium 
thermodynamics, it is important to examine the assumptions made in approximating the inherent structure of states, 
which we do in the following section. 

IV. THE LIMITATIONS OF THE ISL APPROACH 
A. Local normal mode analysis 

In Section|TT]we introduced the major assumptions (Al) and (A2) of the ISL approach. The derivation of thermody- 
namic quantities such as the specific heat is carried out as if the free energy contribution within a basin of attraction, 
defined by ([2]), did not depend on the particular inherent structure a^. It is difficult to test this assumption as it 
would in principle require the determination of the complete basin on the energy landscape, including the calculation 
of all the saddle points that determine the frontier of the basin as well as the shape of the basin within this frontier. 
Still, one can at least compute F^{a) in the harmonic approximation, as done also in [6]. 

Assuming that the contribution to the integral Q can be approximated by local normal modes in the vicinity of 
the energy minimum, the effective free energy can be written as 

, I \\ 37V-6 3Af-6 

l3FMMA{a,T) = = E logK(a)M(0)) - Y ^ogiksT/h^M) 

q=l \ B / ^^-^ 

= fNMA{a)+f3FNMA{0,T) , (8) 

where ujq{0) are the normal mode frequencies at the ground state a = 0. This expression allows us to calculate 
a harmonic approximation to Fy(a,T) by calculating the normal modes for each minimum that we sample, and 
subsequently summing their different contributions according to ([S]). Figure [H] shows the a-dependent part fNMA(o) 
as a function of the energy minimum for the two examples of src-SHS domain (left) and ww domain (right). The 
minima were obtained along a single trajectory close to the folding temperature. The distribution of the minima along 
the energy axis reflects the character of the probability distribution function for the two proteins, one being divided 
into two basins (src-SHS domain), the other being a single distribution (ww domain). 

As a first important observation, we note that the variation of /jv ma (q^) with Cq cannot be assumed to be negligible 
as assumed in (A2), according to which Fnma should be approximately constant in Cq. Both proteins show the same 
trend toward a decreasing effective free energy with increasing Cq. For the src-SHS domain, a nonlinear dependence 
can be observed in the high energy range in agreement with previously reported results on protein G [gI]. 
A second point to be noticed is that, for a given e^, a distribution of values of fNMA{oi) can be found (variance on 
the y-axis in figure [S]). While such a variance (or fluctuation) could be attributed to the limited numerical accuracy of 
the normal modes at high energies, this is also true for the low energy part for which the numerical scheme provides 
accurate results, as can be checked on the vanishing six lowest eigenmodes. Consequently, it appears that the first 
approximation (Al), i.e. F{a,T) « F{ea,T) does not hold as assumed, leaving the possibility that the observed 
deviations in section [TTTl could be caused by this simplification. 

B. Free energy correction 

The results of the previous subsection seem to challenge the validity of the ISL approach performed under the 
assumptions (Al) and (A2). At a first glance, the main problem seems to arise from the approximation (A2) that 
Fy{a, T) does not depend on the particular basin considered, which is obviously untrue. On the other hand, although 
there is clearly a variance associated to Fv for the same energy range Bq, in disagreement with approximation (Al), 
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FIG. 6: The frequency dependent component "^q^i ^'^si^q) of the vibrational free energy PFmma in the harmonic approxi- 
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one can observe a general evolution of Fy{a,T) with Cq, which suggests that approximating F„(a,T) by F„(eQ,T) 
may be acceptable. However, we show below that, if the free energy within a basin can be written as a sum 

PF4a,T) = /„(e„)+/3F40,r) , (9) 

with fv{ea = 0) := at the ground state, the calculation of Cy.is can he carried out without any change, so that 
approximation (A2) appearing to be to particularly bad at a first glance may not be the decisive one. It should be 
noticed that, as discussed in the previous section, the property ^ is verified if the motion in each basin of attraction 
can be described by a combination of harmonic vibrations. 

Starting again from ([T]), and proceeding as in section FlI Al we can eliminate the part of Fy(a,T) that depends on 
temperature only in the expression for the partition function and the probability distribution function, keeping only 
the a-dependent part. One gets 



Z{T) = e-''^-^"^^) j njs{ea)e-''^-e-f^^^^^ de^ , (10) 
Zis{T) = / nis{ec.)e-^'-e-f^^'-^deo. . (11) 



The principle of the calculation is to compute Zis{T) form a measurement of the probability density function Pis{T) 
estimated from MD simulations at a given temperature Tmd- This can be achieved through the intermediate calcula- 
tion of a density of states of inherent structures U,is{ea), which is temperature independent and from which Zis{T) 
can be obtained at all temperatures. In equation (jlip . we notice that the inclusion of the term e"-'^"*-'^"^ is equivalent 
in definition an "effective" density of states ri7s(eQ,)e~'^'''-'^°-' from which the classical thermodynamic expression can 
be derived as shown below. In the calculation of section III A[ the density of states is given by equation ^ . This 

density 17 and all other observable calculated in section Hi Al will henceforth denote with and index (0). In the new 
scheme including the a-dependent part of the free energy in the harmonic approximation, the probability density 
Pis{&a,T) becomes 

Pisiec.T) = ^-l-^f],5(e„)e-^^°e-/"(^") . (12) 



yielding the inherent structure density 



f^/5(e.) = Mf|lZi££)e^M.-e/"(-) . (13) 
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The latter expression shows that if the variation of Fy{a, T) with cannot be ignored, the previously derived density 
of states fij^^(ea) is not the correct one. The two are related by 

nisie^) = r!fj(e„)e^"(^°) . (14) 

A similar result was also reported in [G] which considered the particular case of a piecewise linear dependence on Cq. 
Though the densities differ, substituting (fT4| in equations (fTTj) or ([4]), we immediately have Zis{T) = Z°g(r), and the 

inherent structure observables such as Uis ~ (bq) and Cv,is = {{Sa) ~ i^a)^) /ksT^ are unchanged, i.e. Uis — U^^J 

and Cv,is — C^v\s ^ven when /u 7^ 0. 

As a consequence, using equation (jlOp . the full free energy F{T) of the protein can be written as 

F(T) = -fcBTlog(Z(T)) = -fcBTlog(Z/s(T))+F„(0,T) 

= -kBT\og{Z%\T)) + Fy{0,T) = Fi°J +Fy{Q,T) . (15) 

Therefore, in the ISL formalism, taking into account the variation of Fy{a, T) as in equation Q does not alter the free 
energy and cannot be expected to be at the origin of the quantitative differences between Cy derived from the ISL 
formalism and the full numerical results presented in section IIIII We conclude that the origin of these discrepancies 
is likely to be found in the non-separability between the a and the T dependency within the basins. Such an non- 
separability can be expected as soon as the anharmonicity of the different basins is taken into account. This is certainly 
relevant for proteins, in particular as the denaturation involves frequent transitions between basins of different shape 
and volume associated to the semi-rigid folded and the highly flexible unfolded state. 

When Fy{pL,T) cannot be separated into a- and T-dependent contributions to simplify the calculation of the partition 
function, the remaining possible approximation that can tackle the computational difficulty would be the saddle 
point approximation of the free energy . This approximation is acceptable in the thermodynamic limit for large 
systems, but cannot be justified in the present problem as the number of particles involved is still small and the 
interactions between particles are heterogeneous. While the harmonic approximation seems to be invalid for the 
present problem which involves large conformational changes due to the denaturation transition, results on super- 
cooled liquids [Tt'I indicate that the correction of the heterogeneity of the basins at low temperatures is small and the 
decoupling approximation of vibrational and inherent structure contributions appears to be possible at least in these 
temperature regimes. 



C. Effect of limited sampling efficiency 

The main practical difficulty of the ISL method comes from the need to properly sample all the inherent structures 
in order to get a meaningful density of states ^is{Ga) in all energy ranges. In the present scheme of inherent structure 
sampling, a single temperature is selected to simulate the dynamics of the protein for a finite period of time. The 
choice of a temperature close to the folding temperature is natural as the protein samples both the folded and the 
unfolded configurations at this temperature. There are however two questions that have to be answered: i) How 
long should we follow a protein MD trajectory to get a sufficient sampling? ii) Is it possible to combine data from 
simulations at a few different temperatures instead of keeping T fixed? 

Let us first analyze the effect of the sampling time. It should be chosen long enough to cover the slowest intrinsic 
timescale of the system, and various trajectories with different initial conditions or realizations of the thermostat 
should be used to ensure that the order of events does not alter the shape of the distribution. An estimate of the time 
range that sampling must cover is provided by the folding/unfolding time of the protein, to guarantee that the protein 
explores both configurational subspaces. A possible check of the choice of the simulation time is to compute the 
specific heat with an increasing number of samples, and stop when the improvement brought by additional samples is 
negligible. Still, as computer time is limited, it cannot be ensured that all relevant states are sufficiently well sampled 
to yield a converged probability density. In particular, the high energy minima are sampled only with low probability, 
such that the high energy cut-off in the density of states is likely to be underestimated in finite time sampling. In 
this section, we analyze the impact of this cutoff on a model density to see how the inherent structure specific heat 
is possibly affected. 

To analyze the effect of the sampling independently of a particular case, let us assume a " model" inherent structure 
density of states taken as a single exponential 

^ - J ^""^'^ < < Bmax 

i^/sie„j-< e <e 
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FIG. 7: Le/t; The specific heat Cv.is as a function of temperature and the cutoff parameter emax- Right: Location of the 
maxima of Cv,is for different values of the cutoff emax- 

similar to the shape of the densities that can be found in limited ranges of energies for the numerical results in section 
mil The partition function than can be readily calculated as 



Likewise, we can calculate the first two moments (ca) and (e^) to find the specific heat as a function of the temperature, 
the parameter a and cutoff in energy emax 



and analyze the result graphically as a function of the energy cutoff Cmax in figure [71 As can be inferred from left-hand 
side of figure [71 max^ [Cvjs{T', a, emax)] increases with higher cut-off emax- Using symbolical computation [l^, we 
can further inspect the result to find the local maxima of the the specific heat for fixed cutoff emax - On the right-hand 
side of figure [7l we observe that a lower cutoff in the density of states shifts the maximum of the specific heat towards 
higher temperatures. In addition to the shift, the curve becomes broader and the value at the maximum decreases. 
This situation is similar to the physical scenario when protein folding is altered by confinement (see e.g. [Toj). The 
high energy states disappear from the density of states as the system is prohibited to explore these by external forces. 
For the present purpose of the inherent structure analysis, although derived for a highly idealized model density, the 
results indicate that insufficient sampling at high energies can significantly alter the global shape of the transition. 
We have checked that these conclusions remain unchanged for a piecewise constant density of states. For instance, 
for the WW domain, one finds emax/o. ~ 60 which is higher than the range of values for which a shift of the maximum 
can be expected from figure [71 Consequently, the origin of this shift cannot be attributed to inefficient sampling in 
the high energy range. 

Because the high energy minima are less frequently visited, it is tempting to try to sample the minima from a high 
temperature molecular dynamics trajectory. On the other hand, as the method relies on the probability to occupy 
the ground state which determines l/Zjs{T), it is also necessary to properly sample the ground state, i.e. to select 
a simulation temperature which is below Tf- To reconcile these two exclusive conditions, one solution is to combine 
results sampled at two different temperatures to calculate ri/s(ea), which should be temperature independent. This 
is possible because, according to ([3]) 




(16) 



el) ~ {eo.? 



(17) 



Pjs(ec,ri) ^ Zis{T2) 



g-(/3l-/32)e, 



(18) 



with Pi^2 = l/(fcs'7i.2), so that the ratio of Zjs{T2)/ Zis{Ti) can be calculated from the probabilities to occupy a 
basin at temperatures Ti and T2- A molecular dynamics trajectory obtained at a temperature Ti < Tf can be used to 
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determine Zis{Ti) from the probability to occupy the ground state, and subsequently a second simulation at a higher 
temperature T2 can sample high energy basins more efficiently. For all basins which are properly sampled in both 
molecular dynamics runs, the ratio Zjs{T2) / Zis{Ti) can be evaluated with (fT8|) . Although it should not depend on 
the particular basin that was used for its calculation, this ratio actually fluctuates around a mean value which can be 
used to determine Zjs{T2) from Zjs(Ti). Then ([3]), applied at the higher temperature T2, can be used to compute 
^isi^a) in the high energy range. Moreover, in the intermediate energy range, the basins are properly sampled by 
the two trajectories at temperatures Ti and T2, which gives two ways to evaluate ^is{ea) for those basins, and thus 
provides a way to check the consistency of the method. 




FIG. 8: Results from the sampling of inherent structures of the ww domain at two different temperatures Ti = 1.03 Tf 
(~ 79000 minima, red) and T2 = 1.41 Tf (« 18000 minima, blue) Left: Probability of occupation of the inherent structures 
versus e^. Right: Density of inherent states energy (in logarithmic scale), calculated from the simulation at Ti (red) and 
calculated at T2 using the ratio ZisiT^) j ZisiT\). 

Figure [5] shows the results for the ww domain from different molecular dynamics simulations, simulated respectively 
at Ti/Tf = 1.03 and T^jTf — 1.41. For the simulations at T2, the ground state is not sampled in the finite interval 
of time of the simulations, but $7/5 can nevertheless be obtained through the evaluation of Zjs(T2) deduced from 
Zjg{Ti) = l/po{Ti) for all values of for which the histograms of the left-hand side of figure [S] overlap. The right- 
hand side of figure [5] shows that the values of the density of inherent state energies computed from data at Ti and 
T2 are in rather good agreement in the whole energy range where the basins are sampled at the two temperatures. 
There is however a discrepancy between the two results, with a systematic deviation towards lower values of ^isi^a) 
for high inherent state energies when the density of states is calculated with data sampled at high temperatures. This 
is counterintuitive as one could tend to believe that sampling the basins at low temperatures would, on the contrary, 
underestimate the density of basins at high energy. This systematic deviation, which has been observed in all our 
calculations could point out a limitation of the ISL method as presently applied, i.e. with a calculation which is only 
valid if PFy{a, T) is the sum of a term that depends on and a term that depends on temperature (see (O ). When 
the specific heat is calculated with the value of ^is{&a) extended in the high energies range with this method, the 
function is still very close to the value obtained with only the data at temperature Ti and the agreement with the 
numerical value of Cv{T) is not improved. This shows that the discrepancy between the exact results and those 
deduced from the ISL approach are not due to technical difficulties such as an insufficient sampling in some energy 
range, but that they are rather inherent to the method itself together with its assumptions. 

V. DISCUSSION 

We applied the inherent structure landscape (ISL) approach to four different proteins of varying size and secondary 
structure elements using a coarse-grained off-lattice protein model, and calculated their inherent structure density 
of states. Using these densities, we derived the specific heat from the reduced inherent structure thermodynamics, 
and compared it to the value obtained from equilibrium molecular dynamics as a function of temperature. Our 
results show that the ISL approach can correctly capture the shape of the temperature dependence of the specific 
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heat, including some characteristic features such as the hump observed at T w 0.4T/ for the ww domain. This is 
remarkable since the result is deduced from molecular dynamics simulations at a single temperature, close to Tf, and 
nevertheless predicts the main features of the specific heat in a large temperature range, including low temperatures 
for which very long simulations would be necessary to reach exhaustive sampling of the phase space. This shows 
that many features of protein thermodynamics are encoded in the inherent structure landscape. Still, the approach is 
not perfect as we observed quantitative differences between Cv.is{T) and Cv{T) which are particularly significant 
for small protein domains. The deviations show a systematic trend, the specific heat being underestimated below Tf 
and overestimated above. This lead us to reexamine the approximations that enter the construction of the reduced 
thermodynamics from inherent structures for the model that we considered. The first approximation assumes that 
the correction to the density of states due to the structure of the basin associated to a minimum /„ (a) depends on 
the energy level of the minimum only, and not of the individual minimum. An evaluation of fv{ct) using a harmonic 
approximation based on local normal modes (section IIV A[) shows that this is only approximately correct. For a 
given Ca, the values of /t,(a) are actually distributed around an average value, with fluctuations that grow for higher 
values of Ca and that appear to be larger for small proteins. This could explain some of the discrepancies between 
the ISL results and the equilibrium data. Moreover, the calculation of fv{o) indicates that the second assumption 
that the correction can be considered to be a-independent is certainly not valid. However, we showed that if [3F^{a) 
splits into a temperature-dependent and an a-dependent part, which is the case in the harmonic approximation, most 
of the thermodynamic results deduced from a direct application of the ISL approach are not affected. This is true 
in particular for the the specific heat. In view of the persisting quantitative differences between reduced inherent 
structure and equilibrium thermodynamics, we therefore conclude that the correction of the free energy in term of 
a harmonic approximation is not sufficient. It is likely that the nonlinear terms in the free energy associated to a 
basin cannot be ignored, and play a significant role. This is not surprising because, especially in the high temperature 
range, the protein fluctuates by exploring many basins, and consequently cannot be assumed to be well described by 
a harmonic approximation. 

In future studies, it would be useful to analyze the role of the structure of the full basin on the thermodynamic results 
beyond the approximation by local normal modes around the minima. This is a true challenge owing to the complexity 
of the energy landscape. A starting point for such a study might be the examination of the distribution of first rank 
saddle points associated to the different minima on the potential energy surface. A second aspect which is suggested 
by the present work is to apply the ISL approach for protein folding in the context of more complex energy landscapes 
that arise in more realistic potential energy functions. The results on the small proteins analyzed in this study show 
that the global separation of the probability density into two basins associated to folded and unfolded states is not 
a necessary requirement to construct the reduced thermodynamics. It appears therefore likely that the formalism 
remains useful in cases where the energy landscape is less biased towards the ground state than in the Go-model. An 
application of the ISL method to other protein models therefore appears to be desirable and promising. 
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APPENDIX A: MODEL HAMILTONIAN AND PARAMETERS 

In this work, we analyze the properties of an off-lattice Go-type [l|, Q in which the smallest building unit is a single 
amino acid. Effective interactions between amino acids are based on the reference positions of the Ca-carbons of each 
residue. These interactions are "color-blind" in the sense that they do not distinguish between the type of amino 
acids. The potential energy of the system comprises five terms: 
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Here, denotes the Euclidean distance between residues i,j. The 3N — 6 degrees of freedom of the system are most 
conveniently expressed via internal coordinates: TV — 1 Euclidean bond distances di along the backbone, N — 2 bond 
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angles formed between two consecutive bond directions, and A''— 3 dihedral angles measured between the normal vectors 
of planes spanned by atoms {i, i + i + 2) and (i + l, i+2, j+3). Bond distance and bond angle interactions are modelled 
through harmonic forces with coupling strengths Kh and Ka respectively. The zero indices indicate that the quantities 
(angles, distances) are evaluated in the reference state, i.e., the position from the NMR/crystallographic structure. 
The dihedral angle potential does not assume a minimum in the reference position defined by the experimentally 
resolved structure: it favors angles close to tt/A and 37r/4 irrespective of the secondary structure element (helix, sheet, 
turn) the amino acid belongs to. While such values can be found in a-helices, they statistically do not appear largely 
in other secondary structure elements. As a consequence, the reference state defined by an NMR or crystallographic 
structure can give rise to a competition between the helix for which the reference state is close to its minimum 
energy, and other parts which experience forces due to the angular constraint. The constraint was introduced as a 
source of additional "frustration" affecting the dynamics and thermodynamics of the model towards a more realistic 
representation [T]. Attractive non-bonded native interactions are modelled by steep (12,10)-Lennard-Jones potentials 
accounting for non-local contacts between residues due to side chains. The database of contacts is established as 
described in sectionllll In addition, non-native repulsive interactions are added to those residues which do not form a 
contact and which lie at least four residues apart. The dimensionless parameters used in this study are [H: i^f, = 200.0, 
Ka = 40.0, Kd = 0.3, e = 0.18, C = 4.0. 
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